Document Similarity Misjudgment by LSA: Misses vs. False Positives
نویسندگان
چکیده
Modeling text document similarity is an important yet challenging task. Even the most advanced computational linguistic models often misjudge document similarity relative to humans. Regarding the pattern of misjudgment between models and humans, Lee and colleagues (2005) suggested that the models’ primary failure is occasional underestimation of strong similarity between documents. According to this suggestion, there should be more extreme misses (i.e., models failing to pick up on strong document similarity) than extreme false positives (i.e., models falsely detecting document similarity that does not exist). We tested this claim by comparing document similarity ratings generated by humans and latent semantic analysis (LSA). Notably, we implemented LSA with 441 unique parameter settings, determined optimal parameters that yielded high correlations with human ratings, and finally identified misses and false positives under the optimal parameter settings. The results showed that, as Lee et al. predicted, large errors were predominantly misses rather than false positives. Potential causes of the misses and false positives are discussed.
منابع مشابه
Information Retrieval Perspective to Interactive Data Visualization
Dimensionality reduction for data visualization has recently been formulated as an information retrieval task with a well-defined objective function. The formulation was based on preserving similarity relationships defined by a metric in the input space, and explicitly revealed the need for a tradeoff between avoiding false neighbors and missing neighbors on the low-dimensional display. In the ...
متن کاملThe Neural Mechanism of Encountering Misjudgment by the Justice System
Although misjudgment is an issue of primary concern to the justice system and public safety, the response to misjudgment by the human brain remains unclear. We used fMRI to record neural activity in participants that encountered four possible judgments by the justice system with two basic components: whether the judgment was right or wrong [accuracy: right vs. wrong (misjudgment)] and whether t...
متن کاملAn index-based algorithm for fast on-line query processing of latent semantic analysis
Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the quer...
متن کاملPredicting False Positives of Protein-Protein Interaction Data by Semantic Similarity Measures
Recent technical advances in identifying protein-protein interactions (PPIs) have generated the genomic-wide interaction data, collectively collectively referred to as the interactome. These interaction data give an insight into the underlying mechanisms of biological processes. However, the PPI data determined by experimental and computational methods include an extremely large number of false...
متن کاملImplications of Memory Research for Criminal Law Procedure
ion Philosophy MessengerstudentHITSPhilosophy MessengerstudentPhilosophy MessengerstudentFALSE POSITIVES MISSES Figure 7. Showing percentage hits, false positives and missesas a function of preceding task. Inspection of the graphs in figure 7 shows an absenceof any consistent pattern. Prior description had no effecton the identification of the philosophy ...
متن کامل